Problem Statement¶

Objective¶

AllLife Bank aims to expand its base of personal loan customers by converting liability customers (depositors) into borrowers. A previous campaign achieved a conversion rate of over 9%, indicating potential growth opportunities in this area. The current task is to develop a predictive model to identify customer attributes that significantly influence loan purchases and to determine segments of customers with a higher probability of opting for personal loans. This model will assist in targeted marketing efforts, enhancing the effectiveness of future campaigns.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Analysis Conclusions¶

  • Both the pre-pruned and post-pruned decision trees exhibit good generalization, performing similarly on both training and test sets.
  • The pre-pruned decision tree has approximately 2% higher precision on the training set compared to the test set.
  • The post-pruned decision tree shows a higher precision score on the test set compared to the pre-pruned model.
  • Both pre-pruned and post-pruned models utilize the same features (Income, Education_2, CCAvg, Education_3, Family, and Age) with similar relative importance.
  • We will select the post-pruned model as the best for this problem for the following reasons:
    • It has a better precision score on the test set, which is crucial for minimizing false positives in our marketing campaign.
    • While depth is not a direct measure of performance, in this case, the post-pruned model's slightly higher depth compared to some pre-pruned iterations resulted in better test precision.

Business Recommendations¶

  • Bank's marketing team can deploy this model to identify which of their liability customer's have higher potential to purchase a loan.
  • Using the likelyhood score, bank can tailor their marketing targets
  • Income and education factors of customer's are most important contribution in the decision making process.
  • Credit card spend habits and family size atributes also play a role
  • Marketing startegies can be tailored by the bank towards the target segment of customers.

1 - Import necessary libraries¶

We'll use decision tree, one of classification algorithm to model this to predict the categorical variable,Personal_Loan.

Instruction: Restart the runtime after installing libraries to ensure correct package versions and ignore dependency warnings.

In [1]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
In [2]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data
from sklearn.model_selection import train_test_split

# To build classification model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To compute various classification metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

2 - Load the dataset¶

In [3]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
Loan = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Project-2/loan_modelling.csv")
In [5]:
# copy data
data = Loan.copy()

3 - Overview Data¶

3.1 View sample rows¶

In [6]:
data.head()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [7]:
data.tail()
Out[7]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [8]:
data.sample(5)
Out[8]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
2533 2534 54 29 111 93023 1 1.1 2 0 1 0 0 1 0
32 33 53 28 41 94801 2 0.6 3 193 0 0 0 0 0
4260 4261 57 31 52 94105 1 1.4 1 0 0 0 0 1 0
4777 4778 32 8 30 94534 4 0.4 2 78 0 0 0 1 0
3359 3360 43 19 45 91773 3 0.6 2 0 0 0 0 0 0

3.2 Data Shape¶

In [9]:
data.shape
Out[9]:
(5000, 14)
  • The dataset has 5000 rows and 14 columns.

3.3 Data types¶

In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
  • Numerical variables are - ID, Age, Experience, Income, Family, CCAvg, Mortgage,
  • Categorical variables are - Although ZIPCode, Education, Perosnal_Loan, Securities_Account, CD_Account, Online, CreditCard are interpreted as numerical, it is categorical variable that is encoded by default

3.4 Statistical Summary¶

In [11]:
data.describe().T
Out[11]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
  • The average age of customers is approximately 45 years, with ages ranging from 23 to 67 years.
  • The average professional experience of customers is 20 years. The value of -3 for experience is an anomaly and does not make sense in this context.
  • The average income of customers is 73K, and the values may be slightly right-skewed.
  • At least 75% of the customers have 3 or fewer people in the family.
  • Average spending on credit card is ~1930 with some data points suggesting spending as high as 10K, which is unusually high compared to the majority of the data. This suggests the presence of outliers.
  • 50% of the customers have an education level of Graduate or less.
  • ~70% of the customers have no mortgage, but the rest have it with some as high as 635K.
  • Majority of customers have not taken personal loan in the last campaign and they do not have either securities account or certificate of deposit accounts.
  • Approximately 59.68% of the customers use internet banking facilities.
  • Approximately 29.40% of the customers use a credit card issued by another bank.
In [12]:
# Count the number of rows where 'Mortgage' is 0
data.loc[data['Mortgage'] == 0]['Mortgage'].value_counts()
Out[12]:
Mortgage
0 3462

In [13]:
data.loc[data['Experience'] == -3]['Experience'].value_counts()
Out[13]:
Experience
-3 4

In [14]:
data['Personal_Loan'].value_counts()
Out[14]:
Personal_Loan
0 4520
1 480

In [15]:
data['Securities_Account'].value_counts()
Out[15]:
Securities_Account
0 4478
1 522

In [16]:
data['CD_Account'].value_counts()
Out[16]:
CD_Account
0 4698
1 302

3.5 Check duplicates and missing values¶

In [17]:
data.duplicated().sum()
Out[17]:
0
  • There are no duplicate entries in the data.
In [18]:
data.isna().sum()
Out[18]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0

  • There are no missing values in the dataset.

3.6 Dropping columns¶

In [19]:
data = data.drop(['ID'], axis=1)

4 - Data Preprocessing -stage 1¶

4.1 Treat Anomalous Values in the Experience column¶

In [20]:
data["Experience"].unique()
Out[20]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
  • Values of -1, -2 and -3 are anomalies
In [21]:
# check for experience < 0
data.loc[data["Experience"] < 0]["Experience"].unique()
Out[21]:
array([-1, -2, -3])
In [22]:
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
In [23]:
data["Education"].unique()
Out[23]:
array([1, 2, 3])

4.2 Feature Engineering¶

In [24]:
# check the number of unique values in the zip code
data["ZIPCode"].nunique()
Out[24]:
467
In [25]:
# Converts the data type of the "ZIPCode" to a string.
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
    "Number of unique values if we take first two digits of ZIPCode: ",
    data["ZIPCode"].str[0:2].nunique(),
)
Number of unique values if we take first two digits of ZIPCode:  7
In [26]:
data["ZIPCode"] = data["ZIPCode"].str[0:2]
In [27]:
data["ZIPCode"].unique()
Out[27]:
array(['91', '90', '94', '92', '93', '95', '96'], dtype=object)
In [28]:
data["ZIPCode"] = data["ZIPCode"].astype("category")
In [29]:
data["ZIPCode"].info()
<class 'pandas.core.series.Series'>
RangeIndex: 5000 entries, 0 to 4999
Series name: ZIPCode
Non-Null Count  Dtype   
--------------  -----   
5000 non-null   category
dtypes: category(1)
memory usage: 5.4 KB
In [30]:
# Convert the data type of categorical features to 'category'
cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
In [31]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Age                 5000 non-null   int64   
 1   Experience          5000 non-null   int64   
 2   Income              5000 non-null   int64   
 3   ZIPCode             5000 non-null   category
 4   Family              5000 non-null   int64   
 5   CCAvg               5000 non-null   float64 
 6   Education           5000 non-null   category
 7   Mortgage            5000 non-null   int64   
 8   Personal_Loan       5000 non-null   category
 9   Securities_Account  5000 non-null   category
 10  CD_Account          5000 non-null   category
 11  Online              5000 non-null   category
 12  CreditCard          5000 non-null   category
dtypes: category(7), float64(1), int64(5)
memory usage: 269.8 KB

5 Exploratory Data Analysis (EDA)¶

5.1 Univariate Analysis¶

In [32]:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )

    # creating the 2 subplots
    # boxplot will be created and a star will indicate the mean value of the column
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )
    # For histogram
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )
    # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )
    # Add median to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="orange", linestyle="-"
    )
In [33]:
# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

5.1.1 Observations on Age¶

In [34]:
histogram_boxplot(data, "Age")
No description has been provided for this image
  • The average age of customers is approximately 45 years.
  • The age distribution appears symmetric.
  • There are no apparent outliers in the age data.

5.1.2 Observations on Experience¶

In [35]:
histogram_boxplot(data, "Experience")
No description has been provided for this image
  • The average experience of customers is 20 years
  • There are no outliers

5.1.3 Observations on Income¶

In [36]:
histogram_boxplot(data, "Income")
No description has been provided for this image
  • The average customer income is $73K
  • The income distribution is right-skewed, indicating the presence of outliers.

5.1.4 Observations on CCAvg¶

In [37]:
histogram_boxplot(data, "CCAvg")
No description has been provided for this image
  • The value distribution is right-skewed
  • There are outliers

5.1.5 Observations on Mortgage¶

In [38]:
histogram_boxplot(data, "Mortgage")
No description has been provided for this image
  • The mortgage value is heavily right-skewed with a high frequency of zero or very low values.
  • There are many outliers, indicated by the individual points plotted to the right of the right whisker, suggesting significantly higher mortgage values for some customers.
  • The long right whisker indicates a wide spread of values in the upper 25% of the data.

5.1.6 Observations on Family¶

In [39]:
labeled_barplot(data, "Family", perc=True)
No description has been provided for this image
  • At least 75% of the customers have 3 or fewer people in the family.

5.1.7 Observations on Education¶

In [40]:
labeled_barplot(data,"Education", perc=True)
No description has been provided for this image
  • 41% of customers have an undergraduate education level.

5.1.8 Observations on Securities_Account¶

In [41]:
labeled_barplot(data, "Securities_Account", perc=True)
No description has been provided for this image
  • Nearly 90% of customers do not have securities account.

5.1.9 Observations on CD_Account¶

In [42]:
labeled_barplot(data, "CD_Account", perc=True)
No description has been provided for this image
  • Nearly 95% of customers do not have certificate of deposit account

5.1.10 Observations on Online¶

In [43]:
labeled_barplot(data, "Online", perc=True)
No description has been provided for this image
  • ~60% of customers use internet banking

5.1.11 Observation on CreditCard¶

In [44]:
labeled_barplot(data, "CreditCard", perc=True)
No description has been provided for this image
  • Approximately 30% of the customers use a credit card issued by another bank.

5.1.12 Observation on ZIPCode¶

In [45]:
labeled_barplot(data, "ZIPCode", perc=True)
No description has been provided for this image
  • ~30% of customers live in the Zipcode starting with "94"
  • Very less customers from Zipcode 96

5.1.13 Just get the count on Personal_Loan¶

In [46]:
labeled_barplot(data, "Personal_Loan", perc=True)
No description has been provided for this image

5.2 Bivariate Analysis¶

In [47]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]

    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    print(tab)
    print("-" * 120)
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))

    plt.legend(loc="upper left", bbox_to_anchor=(1, 1), title=target)
    plt.show()
In [48]:
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title(f"Distribution of {predictor} for target={str(target_uniq[0])} (not opted for {target})")
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title(f"Distribution of {predictor} for target={str(target_uniq[1])} (opted for {target})")
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title(f"Boxplot of {predictor} w.r.t {target} with outliers")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 0],
        palette="gist_rainbow"
    )

    axs[1, 1].set_title(f"Boxplot of {predictor} w.r.t {target} without outliers")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

5.2.1 Correlation check¶

In [49]:
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
No description has been provided for this image
  • Age and Experience are heavily correlated. Higher the age, higher is the experience.
  • The correlation between Income and CCAvg is positive.
In [50]:
# scatter plot matrix
plt.figure(figsize=(15, 7))
sns.pairplot(data, hue="Personal_Loan", diag_kind="kde");
<Figure size 1500x700 with 0 Axes>
No description has been provided for this image
  • The correlation observation between Age and Experince can be double confirmed in this
  • Customers with higher income and higher credit card spending seem to have accepted personal loan in the last campaign.

5.2.2 Loan interest vs Education¶

In [51]:
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
Personal_Loan         0         1
Education                        
3              0.863424  0.136576
2              0.870278  0.129722
1              0.955630  0.044370
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Customers with a higher education level (graduate and above) are more likely to opt for a loan.

5.2.3 Personal_Loan vs Family¶

In [52]:
stacked_barplot(data, "Family", "Personal_Loan")
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------
Personal_Loan         0         1
Family                           
3              0.868317  0.131683
4              0.890344  0.109656
2              0.918210  0.081790
1              0.927310  0.072690
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Bigger families are likely to opt for loans

5.2.4 Personal_Loan vs Securities_Account¶

In [53]:
stacked_barplot(data, "Securities_Account", "Personal_Loan")
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------
Personal_Loan              0         1
Securities_Account                    
1                   0.885057  0.114943
0                   0.906208  0.093792
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Securities_Account don't seem to have strong influence on loan purchases.

5.2.5 Personal_Loan vs CD_Account¶

In [54]:
stacked_barplot(data, "CD_Account", "Personal_Loan")
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
Personal_Loan         0         1
CD_Account                       
1              0.536424  0.463576
0              0.927629  0.072371
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • A significantly higher percentage (46.36%) of customers with a Certificate of Deposit (CD) account purchased a personal loan in the last campaign compared to those without a CD account (7.24%).

5.2.6 Personal_Loan vs Online¶

In [55]:
stacked_barplot(data, "Online", "Personal_Loan")
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------
Personal_Loan        0        1
Online                         
1              0.90248  0.09752
0              0.90625  0.09375
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Use of internet banking do not seem to influence loan purchase.

5.2.7 Personal_Loan vs CreditCard¶

In [56]:
stacked_barplot(data, "CreditCard", "Personal_Loan")
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------
Personal_Loan         0         1
CreditCard                       
1              0.902721  0.097279
0              0.904533  0.095467
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • The use of a credit card from another bank does not appear to significantly influence whether a customer purchases a personal loan

5.2.8 Personal_Loan vs ZIPCode¶

In [57]:
stacked_barplot(data, "ZIPCode", "Personal_Loan")
Personal_Loan     0    1   All
ZIPCode                       
All            4520  480  5000
94             1334  138  1472
92              894   94   988
95              735   80   815
90              636   67   703
91              510   55   565
93              374   43   417
96               37    3    40
------------------------------------------------------------------------------------------------------------------------
Personal_Loan         0         1
ZIPCode                          
93             0.896882  0.103118
95             0.901840  0.098160
91             0.902655  0.097345
90             0.904694  0.095306
92             0.904858  0.095142
94             0.906250  0.093750
96             0.925000  0.075000
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Zip code do not seem to influence loan purchase.

5.2.9 Loan interest vs Age¶

In [58]:
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
No description has been provided for this image
  • Based on the distribution plots, age does not appear to significantly influence personal loan purchases.

5.2.10 Personal Loan vs Experience¶

In [59]:
distribution_plot_wrt_target(data, "Experience", "Personal_Loan")
No description has been provided for this image
  • Based on the distribution plots, the distributions of Age and Experience look quite similar.

5.2.11 Personal Loan vs Income¶

In [60]:
distribution_plot_wrt_target(data, "Income", "Personal_Loan")
No description has been provided for this image
  • Based on the distribution plots, higher income seems to be associated with a higher likelihood of opting for a personal loan.

5.2.12 Personal Loan vs CCAvg¶

In [61]:
distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan")
No description has been provided for this image
  • Based on the distribution plots, higher credit card spending seems to be associated with a higher likelihood of opting for a personal loan.

5.2.13 Personal Loan vs Mortgage¶

In [62]:
distribution_plot_wrt_target(data, "Mortgage", "Personal_Loan")
No description has been provided for this image
  • Mortgage value does not seem to influence personal loan purchases significantly.

5.3 EDA observations¶

  • Based on the exploratory data analysis, the features that may influence personal loan purchases are Income, CCAvg, Education, Family, and CD_Account.
  • Let's proceed to build a decision tree and evaluate further.

6 Data Preprocessing - stage 2¶

6.1 Outlier Detection¶

Let's find the percentage of outliers, in each column of the data, using IQR

In [63]:
# To find the 25th percentile
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25)

# To find the 75th percentile
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)
print("type(Q1) = ", type(Q1))
print(Q1)
type(Q1) =  <class 'pandas.core.series.Series'>
Age           35.0
Experience    10.0
Income        39.0
Family         1.0
CCAvg          0.7
Mortgage       0.0
Name: 0.25, dtype: float64
In [64]:
# Compute Inter Quantile Range (75th percentile - 25th percentile)
IQR = Q3 - Q1

# Finding lower and upper bounds for all numerical features. All values outside these bounds are outliers
lower_whisker = (Q1 - 1.5 * IQR)
upper_whisker = (Q3 + 1.5 * IQR)
In [65]:
# Calculate the percentage of outliers for each numerical column
(
    (data.select_dtypes(include=["float64", "int64"]) < lower_whisker)
    | (data.select_dtypes(include=["float64", "int64"]) > upper_whisker)
).sum() / len(data) * 100
Out[65]:
0
Age 0.00
Experience 0.00
Income 1.92
Family 0.00
CCAvg 6.48
Mortgage 5.82

  • Income has 1.92% outliers, CCAvg has 6.48%, and Mortgage has 5.82%.
  • Based on EDA, Income and CCAvg appear to influence loan purchase, while Mortgage does not.
  • Since the percentage of outliers in Income and CCAvg is relatively small, and their distributions are continuous according to the box plots in sections 5.1.3 and 5.1.4, these entries will be retained for model building.

6.2 Data Preparation for Modeling¶

In [66]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Age                 5000 non-null   int64   
 1   Experience          5000 non-null   int64   
 2   Income              5000 non-null   int64   
 3   ZIPCode             5000 non-null   category
 4   Family              5000 non-null   int64   
 5   CCAvg               5000 non-null   float64 
 6   Education           5000 non-null   category
 7   Mortgage            5000 non-null   int64   
 8   Personal_Loan       5000 non-null   category
 9   Securities_Account  5000 non-null   category
 10  CD_Account          5000 non-null   category
 11  Online              5000 non-null   category
 12  CreditCard          5000 non-null   category
dtypes: category(7), float64(1), int64(5)
memory usage: 269.8 KB
In [67]:
# Drop Experience column as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]

# ZIPCode and Education are one hot encoded.
# Other categorical features do not need one-hot encoding as they have 0 or 1 values.
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)

X = X.astype(float)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1, stratify=Y
)
In [68]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3500, 17)
Shape of test set :  (1500, 17)
Percentage of classes in training set:
0    0.904
1    0.096
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
0    0.904
1    0.096
Name: Personal_Loan, dtype: float64

7 Model Building¶

7.1 Model Evaluation Criterion¶

The objective is to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan

Model can make wrong predictions as below:

  • FP - Model predicts that a customer will purchase a loan (1) but in reality, the customer won't (0)
  • FN - Model predicts that a customer will not purchase a loan (0) but in reality, customer will (1)

Which case is more important?

  • When we predict a customer will purchase a loan, the marketing department allocates significant effort, money, and time to them. However, if that customer ultimately doesn't make a purchase, this represents a loss for the company.
  • On the other hand, if our model predicts a customer won't be a buyer, but they end up making a purchase, that's a positive outcome for the company.

How to reduce the losses?

Therefore, our predictions need to have higher precision. This means we should focus on reducing false positive (FP) predictions, as greater precision directly leads to fewer FPs.
$\text{Precision} = \frac{\text{TP}}{\text{TP + FP}}$

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearnfunction will be used to plot confusion matrix.
In [69]:
# Define a utility function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [70]:
# Define a utility function to plot confusion matrix
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

7.2 Decision Tree (sklearn default)¶

In [71]:
# Create an instance of the decision tree model
model_default = DecisionTreeClassifier(criterion="gini", random_state=1)

# Fit the model to the training data
model_default.fit(X_train, y_train)
Out[71]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

7.2.1 Check model performance for default model on training data¶

In [72]:
confusion_matrix_sklearn(model_default, X_train, y_train)
No description has been provided for this image
In [73]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model_default, X_train, y_train
)
decision_tree_perf_train
Out[73]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

7.2.2 Visualizing the Decision Tree for default model on training data¶

In [74]:
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
In [75]:
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    model_default,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# Add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [76]:
# Text report showing the rules of a decision tree
print(
    tree.export_text(
        model_default, # specify the model
        feature_names=feature_names, # specify the feature names
        show_weights=True # specify whether or not to show the weights associated with the model
        )
    )
|--- Income <= 104.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2519.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |--- weights: [61.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- Age <= 30.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  30.00
|   |   |   |   |   |   |   |   |--- Age <= 45.00
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.00
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [25.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- Age <= 40.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Age >  40.50
|   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- CCAvg <= 4.45
|   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |--- Age <= 61.50
|   |   |   |   |   |   |   |--- CCAvg <= 4.35
|   |   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  4.35
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  61.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |--- Age <= 61.00
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |--- Age >  61.00
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- CCAvg <= 3.85
|   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.85
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.45
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|--- Income >  104.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [458.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- CCAvg <= 2.85
|   |   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.85
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- Age <= 33.00
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  33.00
|   |   |   |   |   |   |--- CCAvg <= 3.27
|   |   |   |   |   |   |   |--- Age <= 50.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Age >  50.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.27
|   |   |   |   |   |   |   |--- Age <= 50.00
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  50.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 67.00] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 114.50
|   |   |   |--- Age <= 28.50
|   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |--- Age >  28.50
|   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- CCAvg <= 2.90
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 1.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- CCAvg >  1.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.90
|   |   |   |   |   |   |   |--- CCAvg <= 4.30
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  4.30
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- Age <= 34.00
|   |   |   |   |   |   |   |--- CCAvg <= 2.15
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  2.15
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Age >  34.00
|   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |--- Age >  60.00
|   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |--- Income >  114.50
|   |   |   |--- weights: [0.00, 155.00] class: 1

In [77]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        model_default.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.357039
Family              0.207239
Education_2         0.163788
Education_3         0.146424
CCAvg               0.059631
Age                 0.052700
CD_Account          0.005728
Online              0.004393
Securities_Account  0.003057
ZIPCode_91          0.000000
ZIPCode_92          0.000000
ZIPCode_93          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
Mortgage            0.000000
CreditCard          0.000000
In [78]:
importances = model_default.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

7.2.3 Check model performance for default model on test data¶

In [79]:
confusion_matrix_sklearn(model_default, X_test, y_test)
No description has been provided for this image
In [80]:
decision_tree_perf_test = model_performance_classification_sklearn(model_default, X_test, y_test)
decision_tree_perf_test
Out[80]:
Accuracy Recall Precision F1
0 0.981333 0.861111 0.939394 0.898551
In [81]:
print(decision_tree_perf_train)
print("\n")
print(decision_tree_perf_test)
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0


   Accuracy    Recall  Precision        F1
0  0.981333  0.861111   0.939394  0.898551
  • The default decision tree model exhibits overfitting. It achieves perfect predictions on the training data but performs less effectively on unseen test data.

7.3 Pre-pruning - performance improvement¶

In [82]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=42
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # Calculate precision scores for training and test sets
            train_precision_score = precision_score(y_train, y_train_pred)
            test_precision_score = precision_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test recall scores
            score_diff = abs(train_precision_score - test_precision_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_precision_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_precision_score
                best_estimator = estimator

# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test precision score: {best_test_score}")
Best parameters found:
Max depth: 4
Max leaf nodes: 50
Min samples split: 10
Best test precision score: 0.7595628415300546
In [83]:
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train)
Out[83]:
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
                       min_samples_split=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
                       min_samples_split=10, random_state=42)

7.3.1 Check performance on training data¶

In [84]:
confusion_matrix_sklearn(estimator, X_train, y_train)
No description has been provided for this image
In [85]:
decision_tree_pre_tune_train = model_performance_classification_sklearn(estimator, X_train, y_train)
decision_tree_pre_tune_train
Out[85]:
Accuracy Recall Precision F1
0 0.97 0.979167 0.770492 0.862385

7.3.2 Visualize the Decision Tree¶

In [86]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [87]:
# Text report showing the rules of a decision tree

print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- ZIPCode_93 <= 0.50
|   |   |   |--- weights: [1243.36, 0.00] class: 0
|   |   |--- ZIPCode_93 >  0.50
|   |   |   |--- weights: [119.47, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 81.50
|   |   |   |--- Age <= 36.50
|   |   |   |   |--- weights: [7.19, 15.62] class: 1
|   |   |   |--- Age >  36.50
|   |   |   |   |--- weights: [33.74, 0.00] class: 0
|   |   |--- Income >  81.50
|   |   |   |--- CCAvg <= 4.40
|   |   |   |   |--- weights: [21.02, 78.12] class: 1
|   |   |   |--- CCAvg >  4.40
|   |   |   |   |--- weights: [8.85, 0.00] class: 0
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [266.59, 15.62] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [9.40, 322.92] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- weights: [9.96, 52.08] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 348.96] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 112.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- weights: [23.23, 20.83] class: 0
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- weights: [3.87, 52.08] class: 1
|   |   |--- Income >  112.50
|   |   |   |--- Age <= 25.50
|   |   |   |   |--- weights: [0.55, 0.00] class: 0
|   |   |   |--- Age >  25.50
|   |   |   |   |--- weights: [2.77, 843.75] class: 1

In [88]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                             Imp
Income              6.673960e-01
Education_2         1.594027e-01
CCAvg               7.199187e-02
Education_3         5.530637e-02
Family              3.727906e-02
Age                 8.623986e-03
ZIPCode_93          1.505678e-15
CD_Account          0.000000e+00
Online              0.000000e+00
Securities_Account  0.000000e+00
ZIPCode_91          0.000000e+00
ZIPCode_92          0.000000e+00
ZIPCode_94          0.000000e+00
ZIPCode_95          0.000000e+00
ZIPCode_96          0.000000e+00
Mortgage            0.000000e+00
CreditCard          0.000000e+00
In [89]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

7.3.3 Checking performance on test data¶

In [90]:
confusion_matrix_sklearn(estimator, X_test, y_test)
No description has been provided for this image
In [91]:
decision_tree_pre_tune_test = model_performance_classification_sklearn(estimator, X_test, y_test)
decision_tree_pre_tune_test
Out[91]:
Accuracy Recall Precision F1
0 0.967333 0.965278 0.759563 0.850153
  • Although the pre-tuned model has lower precision than the default model, the smaller difference in precision score between the training and test data indicates that it is generalizing better.

7.4 Post-pruning - performance improvement¶

In [92]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [93]:
pd.DataFrame(path)
Out[93]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000250 0.000500
2 0.000257 0.001014
3 0.000276 0.001566
4 0.000286 0.002137
5 0.000343 0.002480
6 0.000400 0.003680
7 0.000429 0.004109
8 0.000429 0.004537
9 0.000457 0.004995
10 0.000467 0.005461
11 0.000470 0.009222
12 0.000484 0.010189
13 0.000488 0.010677
14 0.000495 0.011667
15 0.000508 0.012175
16 0.000583 0.012758
17 0.000595 0.013354
18 0.000667 0.016023
19 0.000938 0.016961
20 0.000989 0.017950
21 0.000994 0.018944
22 0.001076 0.021097
23 0.001625 0.022723
24 0.001782 0.024505
25 0.001908 0.026413
26 0.002335 0.028748
27 0.002970 0.031718
28 0.008156 0.039874
29 0.025722 0.091318
30 0.034690 0.126007
31 0.047561 0.173568
In [94]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
No description has been provided for this image

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [95]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04756053380018527

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [96]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
No description has been provided for this image

Precision vs alpha for training and testing sets

In [97]:
precision_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = precision_score(y_train, pred_train)
    precision_train.append(values_train)

precision_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = precision_score(y_test, pred_test)
    precision_test.append(values_test)
In [98]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Precision")
ax.set_title("Precision vs alpha for training and testing sets")
ax.plot(ccp_alphas, precision_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, precision_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image
In [99]:
index_best_model = np.argmax(precision_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0010764119601328905, random_state=1)
In [100]:
best_alpha = best_model.ccp_alpha
print(best_alpha)
0.0010764119601328905
In [101]:
estimator_2 = DecisionTreeClassifier(
    #ccp_alpha=best_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
    ccp_alpha=best_alpha, class_weight="balanced", random_state=1
)
estimator_2.fit(X_train, y_train)
Out[101]:
DecisionTreeClassifier(ccp_alpha=0.0010764119601328905, class_weight='balanced',
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0010764119601328905, class_weight='balanced',
                       random_state=1)

7.4.1 Checking performance on training data¶

In [102]:
confusion_matrix_sklearn(estimator_2, X_train, y_train)
No description has been provided for this image
In [103]:
decision_tree_post_tune_train = model_performance_classification_sklearn(estimator_2, X_train, y_train)
decision_tree_post_tune_train
Out[103]:
Accuracy Recall Precision F1
0 0.982286 1.0 0.844221 0.915531

7.4.2 Visualizing the Decision Tree¶

In [104]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [105]:
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1362.83, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 81.50
|   |   |   |--- Age <= 36.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- weights: [1.11, 15.62] class: 1
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- weights: [6.08, 0.00] class: 0
|   |   |   |--- Age >  36.50
|   |   |   |   |--- weights: [33.74, 0.00] class: 0
|   |   |--- Income >  81.50
|   |   |   |--- CCAvg <= 4.40
|   |   |   |   |--- Age <= 46.00
|   |   |   |   |   |--- Income <= 90.50
|   |   |   |   |   |   |--- weights: [7.74, 0.00] class: 0
|   |   |   |   |   |--- Income >  90.50
|   |   |   |   |   |   |--- weights: [2.21, 10.42] class: 1
|   |   |   |   |--- Age >  46.00
|   |   |   |   |   |--- weights: [11.06, 67.71] class: 1
|   |   |   |--- CCAvg >  4.40
|   |   |   |   |--- weights: [8.85, 0.00] class: 0
|--- Income >  98.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- Income <= 101.50
|   |   |   |   |   |--- CCAvg <= 2.95
|   |   |   |   |   |   |--- weights: [2.77, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.95
|   |   |   |   |   |   |--- weights: [0.55, 15.62] class: 1
|   |   |   |   |--- Income >  101.50
|   |   |   |   |   |--- weights: [263.27, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |--- weights: [4.42, 0.00] class: 0
|   |   |   |   |--- Income >  103.50
|   |   |   |   |   |--- weights: [4.98, 322.92] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [3.87, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [6.08, 52.08] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 348.96] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 112.50
|   |   |   |--- CCAvg <= 2.75
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [14.93, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Income <= 111.50
|   |   |   |   |   |   |--- weights: [3.87, 20.83] class: 1
|   |   |   |   |   |--- Income >  111.50
|   |   |   |   |   |   |--- weights: [4.42, 0.00] class: 0
|   |   |   |--- CCAvg >  2.75
|   |   |   |   |--- Age <= 59.50
|   |   |   |   |   |--- weights: [1.11, 52.08] class: 1
|   |   |   |   |--- Age >  59.50
|   |   |   |   |   |--- weights: [2.77, 0.00] class: 0
|   |   |--- Income >  112.50
|   |   |   |--- weights: [3.32, 843.75] class: 1

In [106]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.667899
Education_2         0.151815
CCAvg               0.074623
Education_3         0.052674
Family              0.040114
Age                 0.012875
CD_Account          0.000000
Online              0.000000
Securities_Account  0.000000
ZIPCode_91          0.000000
ZIPCode_92          0.000000
ZIPCode_93          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
Mortgage            0.000000
CreditCard          0.000000
In [107]:
importances = estimator_2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

7.4.3 Checking performance on test data¶

In [108]:
confusion_matrix_sklearn(estimator_2, X_test, y_test)
No description has been provided for this image
In [109]:
decision_tree_post_tune_test = model_performance_classification_sklearn(estimator_2, X_test, y_test)
decision_tree_post_tune_test
Out[109]:
Accuracy Recall Precision F1
0 0.971333 0.944444 0.795322 0.863492
  • Precision score has increased in post-pruned model

8 Model Performance Comparison and Final Model Selection¶

In [110]:
# training performance comparision

models_train_comp_df = pd.concat(
    [decision_tree_perf_train, decision_tree_pre_tune_train, decision_tree_post_tune_train], axis=0,
)
models_train_comp_df.index = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[110]:
Accuracy Recall Precision F1
Decision Tree (sklearn default) 1.000000 1.000000 1.000000 1.000000
Decision Tree (Pre-Pruning) 0.970000 0.979167 0.770492 0.862385
Decision Tree (Post-Pruning) 0.982286 1.000000 0.844221 0.915531
In [111]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [decision_tree_perf_test, decision_tree_pre_tune_test, decision_tree_post_tune_test], axis=0,
)
models_test_comp_df.index = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)"]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[111]:
Accuracy Recall Precision F1
Decision Tree (sklearn default) 0.981333 0.861111 0.939394 0.898551
Decision Tree (Pre-Pruning) 0.967333 0.965278 0.759563 0.850153
Decision Tree (Post-Pruning) 0.971333 0.944444 0.795322 0.863492

8.1 Compare feature importance side by side¶

In [112]:
fig, axes = plt.subplots(1, 3, figsize=(24, 8))

# plot for feature_importance of default model
importances = model_default.feature_importances_
indices = np.argsort(importances)
axes[0].set_title("Feature Importances (Default Model)")
axes[0].barh(range(len(indices)), importances[indices], color="violet", align="center")
axes[0].set_yticks(range(len(indices)))
axes[0].set_yticklabels([feature_names[i] for i in indices])
axes[0].set_xlabel("Relative Importance")

# plot for feature_importance of pre-pruned model
importances = estimator.feature_importances_
indices = np.argsort(importances)
axes[1].set_title("Feature Importances (Pre-Pruned Model)")
axes[1].barh(range(len(indices)), importances[indices], color="violet", align="center")
axes[1].set_yticks(range(len(indices)))
axes[1].set_yticklabels([feature_names[i] for i in indices])
axes[1].set_xlabel("Relative Importance")


# plot for feature_importance of post-pruned model
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
axes[2].set_title("Feature Importances (Post-Pruned Model)")
axes[2].barh(range(len(indices)), importances[indices], color="violet", align="center")
axes[2].set_yticks(range(len(indices)))
axes[2].set_yticklabels([feature_names[i] for i in indices])
axes[2].set_xlabel("Relative Importance")

plt.tight_layout()
plt.show()
No description has been provided for this image

8.2 Visualize ftree side by side¶

In [113]:
fig, axes = plt.subplots(1, 3, figsize=(30, 15)) # Create a figure with 1 row and 3 columns

# Visualize decision tree for default model in the first subplot
out = tree.plot_tree(
    model_default,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
    ax=axes[0] # Specify the subplot to plot on
)
axes[0].set_title("Decision Tree (Default Model)") # Set title for the subplot
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

# Visualize decision tree for pre-pruned model in the second subplot
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
    ax=axes[1] # Specify the subplot to plot on
)
axes[1].set_title("Decision Tree (Pre-Pruned Model)") # Set title for the subplot
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

# Visualize decision tree for post-pruned model in the third subplot
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
    ax=axes[2] # Specify the subplot to plot on
)
axes[2].set_title("Decision Tree (Post-Pruned Model)") # Set title for the subplot
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

plt.tight_layout() # Adjust layout to prevent overlapping titles/labels
plt.show()
No description has been provided for this image

9 Actionable Insights and Business Recommendations¶

9.1 Observation derived from performance metrics¶

  • Both the pre-pruned and post-pruned decision trees exhibit good generalization, performing similarly on both training and test sets.
  • The pre-pruned decision tree has approximately 2% higher precision on the training set compared to the test set.
  • The post-pruned decision tree shows a higher precision score on the test set compared to the pre-pruned model.
  • Both pre-pruned and post-pruned models utilize the same features (Income, Education_2, CCAvg, Education_3, Family, and Age) with similar relative importance.
  • We will select the post-pruned model as the best for this problem for the following reasons:
    • It has a better precision score on the test set, which is crucial for minimizing false positives in our marketing campaign.
    • While depth is not a direct measure of performance, in this case, the post-pruned model's slightly higher depth compared to some pre-pruned iterations resulted in better test precision.

9.2 Business recommedations to the bank?¶

In [114]:
X_test.iloc[:1, :]
Out[114]:
Age Income Family CCAvg Mortgage Securities_Account CD_Account Online CreditCard ZIPCode_91 ZIPCode_92 ZIPCode_93 ZIPCode_94 ZIPCode_95 ZIPCode_96 Education_2 Education_3
805 55.0 132.0 3.0 5.9 307.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
In [115]:
%%time
# choosing a data point
test_customer = X_test.iloc[:1, :]

# making a prediction
loan_potential = estimator_2.predict(test_customer)

print(loan_potential)
[1]
CPU times: user 3.3 ms, sys: 0 ns, total: 3.3 ms
Wall time: 3.1 ms
In [116]:
# making a prediction
loan_purchase_likelihood = estimator_2.predict_proba(test_customer)

print(loan_purchase_likelihood[0, 1])
0.9960822722820765
  • This indicates that the model is ~99% confident that the test_customer will purchase a loan.
  • Bank's marketing team can deploy this model to identify which of their liability customer's have higher potential to purchase a loan.
  • Using the likelyhood score, bank can tailor their marketing targets
  • Income and education factors of customer's are most important contribution in the decision making process.
  • Credit card spend habits and family size atributes also play a role
  • Marketing startegies can be tailored by the bank towards the target segment of customers.

In [ ]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/Project-2/loan_purchase_modelling_dt.ipynb"